Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: check instance state on termination failure #2253

Merged
merged 1 commit into from
Aug 4, 2022

Conversation

dewjam
Copy link
Contributor

@dewjam dewjam commented Aug 4, 2022

Fixes #
#2100 and #1796

Description
If tag conditions are used in Karpenter IAM policies for the TerminateInstances action, Karpenter can get stuck in a loop trying to terminate instances which were previously terminated (and reaped) without Karpenter's knowledge. This fix adds a step where if instance Termination fails, Karpenter will then try to call DescribeInstances to determine the instances state. If the state is "terminated" or if the instance isn't found, then skip the TerminateInstance call and proceed with removing the node in K8S.

How was this change tested?
Manually reproduced the issue by adding tag conditions to the Karpenter IAM policies for the TerminateInstances action. Then launched an instance with said tags. Then, manually removed the tags from the instance. This simulates the case where a terminated instance no longer has the tags required for Karpenter to terminate.

Next, manually deleted the node in K8S, which caused Karpenter to continuously attempt to delete the EC2 instance, but fails with an UnauthorizedException error (this is intended behavior).

2022-08-04T14:30:27.498Z	ERROR	controller.controller.termination	Reconciler error	{"commit": "d94b7ac", "reconciler group": "", "reconciler kind": "Node", "name": "ip-192-168-178-4.us-west-2.compute.internal", "namespace": "", "error": "terminating node ip-192-168-178-4.us-west-2.compute.internal, terminating cloudprovider instance, terminating instance ip-192-168-178-4.us-west-2.compute.internal, UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: <message>\n\tstatus code: 403, request id: <id>"}

Finally, manually terminated the EC2 instance, leaving it in "terminated" state. The next time Karpenter tries to terminate the instance, it also calls DescribeInstances and sees the instance is "terminated". Karpenter then skips the TerminateInstance call and removes the node object.

2022-08-04T14:32:20.144Z	DEBUG	controller.termination	Instance already terminated, ip-192-168-178-4.us-west-2.compute.internal	{"commit": "d94b7ac", "node": "ip-192-168-178-4.us-west-2.compute.internal"}
2022-08-04T14:32:20.162Z	INFO	controller.termination	Deleted node	{"commit": "d94b7ac", "node": "ip-192-168-178-4.us-west-2.compute.internal"}
2022-08-04T14:32:35.692Z	DEBUG	controller.aws.launchtemplate	Deleted launch template Karpenter-dewaard-karpenter-demo-9688201425030023995 (lt-0a942f88cae582d73)	{"commit": "d94b7ac"}

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

Release Note

Adds support for deleting a node for an EC2 instance which was manually Terminated without Karpenter's knowledge

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@dewjam dewjam requested a review from a team as a code owner August 4, 2022 16:04
@dewjam dewjam requested a review from njtran August 4, 2022 16:04
@netlify
Copy link

netlify bot commented Aug 4, 2022

Deploy Preview for karpenter-docs-prod canceled.

Name Link
🔨 Latest commit f2f6b2f
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/62ebee2c2aac74000876aa48

@@ -43,6 +43,10 @@ var (

type SpotFallbackError error

type InstanceTerminatedError struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just make this an error instead of a struct like SpotFallbackError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually started with that, but I think the InstanceTerminatedError error ends up being the *errors.errorString type. Which means that the IsTerminatedError would return true for errors when we expect them to return false. Now that I think of it, I'm not 100% sure the isSpotFallback method is doing what we intend either. Will have to test it further.

Here's an example, btw.
https://go.dev/play/p/cswSjwGQFL2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o wow, good catch!

Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants